Author: Aleix López Pascual
The main libraries involved in this project are:
import numpy as np
import pandas as pd
pd.options.display.max_columns = 50
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import sklearn
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score
from sklearn.linear_model import LogisticRegression
# We tweak the style of this notebook a little bit to have centered plots.
import pylab as plot
params = {
'axes.labelsize': "large",
'xtick.labelsize': 'x-large',
'legend.fontsize': 20,
'figure.dpi': 150,
'figure.figsize': [25, 7]
}
plot.rcParams.update(params)
import warnings
warnings.filterwarnings('ignore')
warnings.filterwarnings('ignore', category=DeprecationWarning)
In this section we are going to cover the following steps:
df_lt = pd.read_csv('C:/Users/Aleix Lopez/Desktop/Data Science Test/Courier_lifetime_data.csv')
df_w = pd.read_csv('C:/Users/Aleix Lopez/Desktop/Data Science Test/Courier_weekly_data.csv')
print('df_lt shape:', df_lt.shape)
print('df_w shape:', df_w.shape)
We start visualizing small samples of both datasets.
df_lt.head()
df_w.head()
We also generate descriptive statistics that summarize the central tendency, dispersion and shape of the dataset distributions, excluding NaN values. We analyze both numeric and object series.
df_lt.describe()
df_lt.describe(include=['O'])
df_w.describe()
df = pd.merge(df_w, df_lt, how='left', on='courier', suffixes=('', '_lt'))
df.head()
print('df shape:', df.shape)
df.describe()
df.describe(include=['O'])
print('Number of missing values in feature_2_lt:', df.feature_2_lt.isnull().sum())
print('Number of unique values in feature_16:', df.feature_16.nunique())
print('Most frequent value in feature_16:', df.feature_16.value_counts().idxmax())
print('Frequency of the most frequent value in feature_16:', df.feature_16.value_counts().max())
We are going to detect and correct corrupt or missing records from the database. We have seen that the only column which contains missing values is feature_2_lt. However, there may be outliers in any of the features.
print('Number of unique values in feature_2_lt:', df.feature_2_lt.nunique())
df.feature_2_lt.value_counts().sort_index()
After analyzing the possible values, we find out some outliers [-61,0,1]. We eliminate them from the data (NaN values).
df.loc[(df.feature_2_lt == -61) | (df.feature_2_lt == 0) | (df.feature_2_lt == 1),['feature_2_lt']] = np.nan
Now, we describe the feature again.
df.feature_2_lt.describe()
print('Number of unique values in feature_2_lt:', df.feature_2_lt.nunique())
print('Number of missing values in feature_2_lt:', df.feature_2_lt.isnull().sum())
We show the univariate studies that we have performed to each feature in order to identify outliers.
sns.countplot(df.feature_1_lt, order = ['a','b','c','d'])
df.feature_1_lt.value_counts().sort_index()
Even though there are little counts of the value 'c', we have not considered it as an outlier since it is a categorical variable and there is the pattern [a,b,c,d].
sns.distplot(df.feature_1)
sns.distplot(df.feature_2)
sns.distplot(df.feature_3)
sns.distplot(df.feature_4)
sns.distplot(df.feature_5)
sns.distplot(df.feature_6)
sns.distplot(df.feature_7)
sns.distplot(df.feature_8)
sns.distplot(df.feature_9)
sns.distplot(df.feature_10)
sns.distplot(df.feature_11)
sns.distplot(df.feature_12)
sns.distplot(df.feature_13)
sns.distplot(df.feature_14)
sns.distplot(df.feature_15)
sns.countplot(df.feature_16)
sns.distplot(df.feature_17)
We conclude that no other feature apart from feature_2_lt needs to be corrected.
Once we have corrected feature_2_lt, we proceed to treat the missing values. There are different methods. For instance, a simple way would be to generate random numbers between the mean and the standard deviation. However, a more accurate way of guessing missing values consists in using other correlated features. Therefore, first we are going to find which features correlate with feature_2_lt. The best way to do it is using a correlation plot, which measures the correlation between every pair of values in a dataset and plots a result in color.
sns.boxplot(
x='feature_1_lt',
y='feature_2_lt',
data=df.dropna(), order = ['a','b','c','d'])
sns.boxplot(
x='feature_16',
y='feature_2_lt',
data=df.dropna())
df_corr = df.dropna().drop(columns=['courier','week','feature_16','feature_1_lt']).corr()
sns.heatmap(df_corr, annot=True)
Given these observations, we proceed to impute the missing feature_2_lt values.
grouped_median_df = df.groupby(['feature_1_lt']).median().reset_index()[['feature_1_lt','feature_2_lt']]
grouped_median_df.head()
def fill_feature_2_lt(row):
condition = (grouped_median_df['feature_1_lt'] == row['feature_1_lt'])
return grouped_median_df[condition]['feature_2_lt'].values[0]
df['feature_2_lt'] = df.apply(lambda row: fill_feature_2_lt(row)
if np.isnan(row['feature_2_lt']) else row['feature_2_lt'], axis = 1)
print('Number of missing values in feature_2_lt:', df.feature_2_lt.isnull().sum())
df.feature_2_lt.describe()
The missing values have been replaced.
Once the data is clean, we proceed to create some interesting charts that will spot correlations and hidden insights out of the data. Since we are asked to label the data after the visualization process, we are not going to provide plots about the correlation between the features and the target variable. Instead, we are going to focus on identifying patterns among the features.
We begin by showing a correlation matrix as we did before, but now using our complete clean data. Notice that trying to represent a linear correlation between categorical variables and numerical variables does not make sense. Thus we have dropped columns 'feature_16' and 'feature_1_lt'. The correlation matrix gives us values between [-1,1], where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation.
df_corr = df.drop(columns=['courier','week','feature_16','feature_1_lt']).corr()
sns.heatmap(df_corr, annot=True)
Positive linear correlation:
Negative linear correlation:
There are features highly or perfectly correlated. Therefore, there is a high chance that the performance of the model will be impacted by the so-called Multicollinearity problem. Multicollinearity happens when one predictor variable in a multiple regression model can be linearly predicted from the others with a high degree of accuracy. This can lead to misleading results. Since we will use a Logistic Regression model, which is not immune to that problem, we should fix it before training the model.
Knowing the correlation between pairs gives us an insight of which sets of features we should study in detail by plotting bivariate or multivariate charts.
During the data cleaning section we saw the univariate plots of all features in order to identify outliers. As we have only modified 'feature_2_lt' since then, we are not going to repeat all plots.
sns.distplot(df.feature_2_lt)
sns.distplot(df.feature_3)
We begin by showing the correlation between the categorical variable feature_1_lt and all numerical features. We use violinplots in order to compare the different distributions.
sns.violinplot(x = 'feature_1_lt', y = 'feature_1', data = df, order=['a','b','c','d'])
sns.violinplot(x = 'feature_1_lt', y = 'feature_2', data = df, order=['a','b','c','d'])
sns.violinplot(x = 'feature_1_lt', y = 'feature_3', data = df, order=['a','b','c','d'])
sns.violinplot(x = 'feature_1_lt', y = 'feature_4', data = df, order=['a','b','c','d'])
sns.violinplot(x = 'feature_1_lt', y = 'feature_5', data = df, order=['a','b','c','d'])
sns.violinplot(x = 'feature_1_lt', y = 'feature_6', data = df, order=['a','b','c','d'])
sns.violinplot(x = 'feature_1_lt', y = 'feature_7', data = df, order=['a','b','c','d'])
sns.violinplot(x = 'feature_1_lt', y = 'feature_8', data = df, order=['a','b','c','d'])
sns.violinplot(x = 'feature_1_lt', y = 'feature_9', data = df, order=['a','b','c','d'])
sns.violinplot(x = 'feature_1_lt', y = 'feature_10', data = df, order=['a','b','c','d'])
sns.violinplot(x = 'feature_1_lt', y = 'feature_11', data = df, order=['a','b','c','d'])
sns.violinplot(x = 'feature_1_lt', y = 'feature_12', data = df, order=['a','b','c','d'])
sns.violinplot(x = 'feature_1_lt', y = 'feature_13', data = df, order=['a','b','c','d'])
sns.violinplot(x = 'feature_1_lt', y = 'feature_14', data = df, order=['a','b','c','d'])
sns.violinplot(x = 'feature_1_lt', y = 'feature_15', data = df, order=['a','b','c','d'])
sns.violinplot(x = 'feature_1_lt', y = 'feature_17', data = df, order=['a','b','c','d'])
sns.violinplot(x = 'feature_1_lt', y = 'feature_2_lt', data = df, order=['a','b','c','d'])
We show the correlation between the two categorical features of our data.
sns.countplot(x="feature_16",
hue="feature_1_lt",
data=df,
hue_order=['a','b','c','d'])
plt.legend(loc='upper right', title = 'feature_1_lt')
We show the correlation between feature_2_lt and both categorical features.
sns.violinplot(
x='feature_16',
y='feature_2_lt',
data=df)
facet_grid = sns.FacetGrid(df, col="feature_1_lt", row="feature_16", col_order=['a','b','c','d'])
facet_grid.map(sns.kdeplot, "feature_2_lt")
We show the correlation between the most correlated numerical features.
sns.pairplot(df[['feature_2', 'feature_3', 'feature_11']])
sns.pairplot(df[['feature_4', 'feature_5', 'feature_9']])
We are going to use the knowledge obtained from the exploratory data analysis to process and transform the features in such a way the data becomes manageable by a machine learning algorithm.
The variable feature_1_lt is a nominal categorical variable. Our model algorithm can only handle numeric values and so we need to create a new variable (dummy variable) for every unique value of the categorical variable. This variable will have a value 1 if the row has a particular value and a value 0 if not.
df_feature_1_lt = pd.get_dummies(df.feature_1_lt , prefix='feature_1_lt')
df_feature_1_lt.head()
df = pd.concat([df, df_feature_1_lt], axis=1)
df.drop('feature_1_lt', axis=1, inplace=True)
df.head()
feature_16 is a categorical variable too, but it is an ordinal one. We saw that feature_16 presents a decreasing behaviour: counts decrease as feature_16 increases. Therefore, there is no need to transform it.
As we stated before, we need to fix the Multicollinearity problem before training the model. There are multiple ways to deal with this problem. The easiest way is to delete or eliminate one of the perfectly correlated features. Another way is to use a dimension reduction algorithm such as Principle Component Analysis (PCA). Since we have already identified the most correlated features, we are going to apply the manual method.
df.drop(columns = ['feature_4'], inplace = True)
As we have seen, there is no target variable in our datasets. All the variables are features that describe the couriers. However, we know we need labeled data in order to apply our supervised learning algorithm: the logistic regression model.
The labeling is performed as follows: If a specific courier’s week 9, 10 and 11 data is not provided, we label this courier as “1” otherwise “0”.
y = df.drop(columns = ['feature_1', 'feature_2', 'feature_3', 'feature_5', 'feature_6',
'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_12',
'feature_13', 'feature_14', 'feature_15', 'feature_16', 'feature_17', 'feature_1_lt_a',
'feature_1_lt_b', 'feature_1_lt_c', 'feature_1_lt_d', 'feature_2_lt'])
# create a new column which assigns 1 if week = 9, 10 or 11; 0 otherwise.
y['pre_label'] = np.where((y.week == 9) | (y.week == 10) | (y.week == 11), 1, 0)
y.drop(columns = ['week'], inplace = True)
# group by courier and we sum pre_label values.
y = y.groupby(['courier']).sum()
# if sum=3, data is present in weeks 9,10 and 11.
y['label'] = np.where(y.pre_label == 3, 0, 1)
y.drop(columns = ['pre_label'], inplace = True)
y.head()
We merge the y dataset with our previous dataset df in order to have same indices.
df = pd.merge(df, y, on='courier', suffixes=('', ''))
df.head()
After labeling, we are asked to remove week 8, 9, 10 and 11 data to avoid bias in the predictive algorithm. Bias is the difference between the average prediction of our model and the correct value which we are trying to predict. Model with high bias pays very little attention to the training data and oversimplifies the model.
# Get names of indexes
indexNames = df[(df.week == 8) | (df.week == 9) | (df.week == 10) | (df.week == 11)].index
# Delete these row indexes from dataFrame
df.drop(indexNames, inplace=True)
df.reset_index(inplace=True, drop=True)
df.head()
We split the dataset into x and y data.
x, y = df.loc[:, df.columns != 'label'], df.label
We also drop columns courier and week since they are non-informative and do not contribute to the target variable.
x.drop(columns = ['courier', 'week'], inplace = True)
x.head()
y.head()
#y.value_counts()
Once we have the data ready to be managed by a machine learning algorithm, we proceed to train the model. We are going to use a logistic regression model. Logistic regression measures the relationship between the categorical dependent variable (target) and one or more independent variables (features) by estimating probabilities using a logistic function, which is the cumulative logistic distribution.
Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called overfitting. To avoid it, it is common practice when performing a (supervised) machine learning experiment to hold out part of the available data as a test set. We are going to hold out 20% of the data for testing (evaluating) our classifier.
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
As we saw during the exploratory data analysis, most of our numeric features have skewed distributions. Dealing with skew data can cause a lot of issues and adversely affect the model. In order to solve this, we can apply a logarithm transformation to make distributions more approximate to normal. Another option is to standardize the data.
Standardization of datasets is a common requirement for many machine learning estimators; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.
Even though standardization is not required for logistic regression, it clearly helps the convergence of the algorithm. We have compared performances with and without standardization and we have found similar scores but faster convergence with standardtization.
scaler = StandardScaler().fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)
When evaluating different settings (hyperparameters) for estimators, there is still a risk of overfitting on the test set because the parameters can be tweaked until the estimator performs optimally. To solve this problem, yet another part of the dataset can be held out as a so-called validation set: training proceeds on the training set, after which evaluation is done on the validation set, and when the experiment seems to be successful, final evaluation can be done on the test set. However, by partitioning the available data into three sets, we drastically reduce the number of samples which can be used for learning the model. A solution to this problem is a procedure called cross-validation (CV).
We are going to use the Grid Search method with cross-validation to tune our hyperparameters.
log_reg = LogisticRegression()
# parameter grid for logistic regression
param_grid = [
{'penalty' : ['l1', 'l2'], # l1 lasso l2 ridge
'C' : np.logspace(-4, 4, 20), # Inverse of regularization strength, smaller values specify stronger regularization
'solver' : ['liblinear']} # The ‘liblinear’ solver supports both L1 and L2 regularization
]
# define the model
grid_search = GridSearchCV(log_reg, param_grid = param_grid, scoring='accuracy', cv = 5, verbose=True, n_jobs=-1)
# n_jobs = -1 means using all processors
# fit on data
best_clf = grid_search.fit(x_train, y_train)
print("Best hyperparameters:", best_clf.best_params_)
print("Accuracy:", best_clf.best_score_)
Grid search trains the algorithm for all possible combinations of the provided set (parameter grid) and measures the performance using the cross validation technique. This validation technique gives assurance that our trained model got most of the patterns from the dataset. On the other hand, randomized search samples a given number of candidates from the set with a specified distribution.
One of the major drawbacks of grid search is that when it comes to dimensionality, it suffers when the number of hyperparameters grows exponentially. In such case, we resort to the randomized search method, since the runtime is drastically lower. Generally, the performance is slightly worse for the randomized search.
In our case, we were only interested in studying the parameters related to regularization. We ended up with a small set of combinations and the runtime was small. Therefore, we are satisfied with the chosen method.
Notice that model-evaluation tools using cross-validation (such as Grid Search) rely on an internal scoring strategy. This score is not the final evaluation. The final evaluation will be obtained later using a specific metric function.
We are going to use our test set (20% of the data) to perform the final evaluation of our classifier. We use the x_test set to predict the output obtained from our model. Then, we use the y_test set to evaluate the performance using a specific metric function.
Given our output variable (y), which is a categorical binary variable, we need a classification metric. The metric to choose not only depends on the target variable but also on the description of the problem.
First of all, we are going to generate a confusion matrix, which is a specific table layout that allows visualization of the performance of the algorithm. It shows the distribution of true positives, false negatives, false positives and true negatives. Using these proportions, we can derive which metric function to use.
One may think about using the accuracy metric since it is the most intuitive performance measure. It is simply a ratio of correctly predicted observations to the total observations. However, accuracy is only a great measure when you have symmetric datasets where values of false positive and false negatives are almost the same. It is clearly not our case since we have many more 1s than 0s.
One may consider using the Precision or Recall metrics. Using one or the other depends on how much you want to avoid false positives versus false negatives. High precision relates to the low false positive rate. High recall relates to the low false negative rate. In our problem we want to weight both errors equally, but we have an uneven class distribution. In such case, one should use the F1 Score, which is the weighted average of Precision and Recall.
# predict some output values
y_pred = best_clf.predict(x_test)
#print('Accuracy score: ', accuracy_score(y_test, y_pred))
print('Confusion matrix score: \n', confusion_matrix(y_test, y_pred))
# true positive | false positive
# false negative | true negative
print('F1 Score: ', f1_score(y_test, y_pred))